The Extensible Markup Language (XML) is a complex language, and consequently,XML-based protocols are susceptible to entire classes of implicit and explicitsecurity problems. Message formats in XML-based protocols are usually specifiedin XML Schema, and as a first-line defense, schema validation should rejectmalformed input. However, extension points in most protocol specificationsbreak validation. Extension points are wildcards and considered best practicefor loose composition, but they also enable an attacker to add uncheckedcontent in a document, e.g., for a signature wrapping attack. This paper introduces datatyped XML visibly pushdown automata (dXVPAs) aslanguage representation for mixed-content XML and presents an incrementallearner that infers a dXVPA from example documents. The learner generalizes XMLtypes and datatypes in terms of automaton states and transitions, and aninferred dXVPA converges to a good-enough approximation of the true language.The automaton is free from extension points and capable of stream validation,e.g., as an anomaly detector for XML-based protocols. For dealing withadversarial training data, two scenarios of poisoning are considered: apoisoning attack is either uncovered at a later time or remains hidden.Unlearning can therefore remove an identified poisoning attack from a dXVPA,and sanitization trims low-frequent states and transitions to get rid of hiddenattacks. All algorithms have been evaluated in four scenarios, including a webservice implemented in Apache Axis2 and Apache Rampart, where attacks have beensimulated. In all scenarios, the learned automaton had zero false positives andoutperformed traditional schema validation.
展开▼